The assembler startup code for the kernel is modified so that it can be used by the other processors to do the processor identification and various other low level configurations but does not execute those parts of the startup code that would damage the running system (such as clearing the BSS segment).
In the initialisation done by the first processor the arch/i386/mm/init code is modified to scan the low page, top page and BIOS for intel MP signature blocks. This is necessary because the MP signature blocks must be read and processed before the kernel is allowed to allocate and destroy the page at the top of low memory. Having established the number of processors it reserves a set of pages to provide a stack come boot up area for each processor in the system. These must be allocated at startup to ensure they fall below the 1Mb boundary.
Further processors are started up in smp_boot_cpus() by programming the APIC controller registers and sending an inter-processor interrupt (IPI) to the processor. This message causes the target processor to begin executing code at the start of any page of memory within the lowest 1Mb, in 16bit real mode. The kernel uses the single page it allocated for each processor to use as stack. Before booting a given CPU the relocatable code from trampoline.S and trampoline32.S is copied to the bottom of its stack page and used as the target for the startup.
The trampoline code calculates the desired stack base from the code segment (since the code segment on startup is the bottom of the stack), enters 32bit mode and jumps to the kernel entry assembler. This as described above is modified to only execute the parts necessary for each processor, and then to enter start_kernel(). On entering the kernel the processor initialises its trap and interrupt handlers before entering smp_callin(), where it reports its status and sets a flag that causes the boot processor to continue and look for further processors. The processor then spins until smp_commence() is invoked.
Having started each processor up the smp_commence( ) function flips a flag. Each processor spinning in smp_callin() then loads the task register with the task state segment (TSS) of its idle thread as is needed for task switching.